Corpus Specific Stop Words to Improve the Textual Analysis in Scientometrics

نویسندگان

  • Vicenç Parisi Baradad
  • Alexis-Michel Mugabushaka
چکیده

With the availability of vast collection of research articles on internet, textual analysis is an increasingly important technique in scientometric analysis. While the context in which it is used and the specific algorithms implemented may vary, typically any textual analysis exercise involves intensive pre-processing of input text which includes removing topically uninteresting terms (stop words). In this paper we argue that corpus specific stop words, which take into account the specificities of a collection of texts, improve textual analysis in scientometrics. We describe two relatively simple techniques to generate corpus-specific stop words; stop words lists following a Poisson distribution and keyword adjacency stop words lists. In a case study to extract keywords from scientific abstracts of research project funded by the European Research Council in the domain of Life sciences, we show that a combination of those techniques gives better recall values than standard stop words or any of the two techniques alone. The method we propose can be implemented to obtain stop words lists in an automatic way by using author provided keywords for a set of abstracts. The stop words lists generated can be updated easily by adding new texts to the training corpus. Conference Topic Methods and techniques

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Use of Lexical Bundles in Native and Non-native Post-graduate Writing: The Case of Applied Linguistics MA Theses

Connor et al. (2008) mention “specifying textual requirements of genres” (p.12) as one of the reasons which have motivated researchers in the analysis of writing. Members of each genre should be able to produce and retrieve these textual requirements appropriately to be considered communicatively proficient. One of the textual requirements of genres is regularities of specific forms and content...

متن کامل

A Corpus-Based Study of the Lexical Make-up of Applied Linguistics Article Abstracts

This paper reports results from a corpus-based study that explored the frequency of words in the abstracts of applied linguistics journal articles. The abstracts of major articles in leading applied linguists journals, published since 2005 up to November 2001 were analyzed using software modules from the Compleat Lexical Tutor. The output includes a list of the most frequent content words, list...

متن کامل

Sociocultural Identity in TEFL Textbooks: A Systemic Functional ‎Analysis

This study aimed at investigating shades of identity in TEFL textbooks. Most identity studies have focused on authors as knowledge producers. They have neglected authors' roles in constructing identity. Further, few scholars have considered disciplinary specific textbooks in their analyses of identity. Trying to bridge these gaps, we applied Halliday’s Systemic Functional Linguistics to investi...

متن کامل

Do We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)

This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...

متن کامل

Conceptual Metaphoric Language Use in Structuring Political Discourse in Iran-West Relations: A CDA Perspective

The present study was carried out with the purpose of examining the role of metaphorical language in the critical discourse analysis (CDA) of political texts based on a modern framework postulated by Kövecses (2015). The corpus of the study consisted of thirty-thousand words chosen as a textual sample to see which source conceptual domains are used and what generic/discursive attributes emerge ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015